2. Data exploration

Set up

Let's import utilities from the previous notebook, and unpickle the cleaned dataset.

'Credit_Limit' is 99.6% correlated with 'Avg_Open_To_Buy'. Therefore the latter feature can be dropped.

We will plot the column distributions using the scatter_matrix(.) function. The resulting array of plots shows:

One can observe various patterns, but no considerable correlations (as expected from the correlation matrix), except "Months on Book" vs. "Customer Age" (correlation of 79%). Moreover, dotted lines indicate binned nature of variables.

Features that distinguish the two categories of customers: ['Total_Relationship_Count', 'Months_inactive_12_mon', 'Contacts_Count_12_mon', 'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']

Credit limit has a peak at the highest bin. This may be a max credit limit offerred by the bank.